Skip to content

Conversation

@kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jul 21, 2025

This patch is based on the changes originally proposed in #18265.
Since that PR had a very long history of tuning the workflows (sincere thanks to @Akshat-Jain for the assistance!),
I have created a fresh PR instead to simplify reviews.

The old PR had only a couple of review comments which have already been answered.
#18265 (comment)
#18265 (comment)

Summary

  • Support running Druid Docker containers in embedded tests
  • Use the Druid distribution image for these containers rather than a test-only image as previously done by ITs
  • Add a CliEventCollector to monitor the containers and write efficient tests
  • Run new job docker-tests in GHA which runs a few Docker-based tests

How to run

From Druid root directory:

$ mvn package -Pskip-static-checks -Pskip-tests -Pdist -DskipUTs=true -Dmaven.javadoc.skip=true -T1.0C
$ DRUID_DIST_IMAGE_NAME=apache/druid:34.0.0-rc1 .github/scripts/run_docker-tests

(This commands runs successfully for the ongoing release RC. Only 1 test for custom node role fails as it requires a minor change to the druid.sh script that is used to start Druid services in the containers)

Main changes to review

  • DruidContainer
  • DruidCommand
  • DruidContainerResource
  • DockerTestBase
  • CliEventCollector

Important scripts

  • .github/workflows/docker-tests.yml
  • .github/scripts/run_docker-tests

New test classes

  • IngestionDockerTest
  • CustomNodeRoleDockerTest
  • IngestionBackwardCompatibilityDockerTest
  • HttpEmitterEventCollectorTest

Change details

  • Add extension druid-testcontainers that can be later published as an individual module for Testcontainers
  • Add DruidContainer - Testcontainer impl for running individual Druid services
  • Add DruidContainerResource to allow use of DruidContainer in embedded cluster tests
  • Make minor modifications to EmbeddedDruidServer and EmbeddedDruidCluster to support
    running both embedded servers and container-based services in the same cluster
  • Remove distribution-checks.yml job
  • Add a new job to run the the docker tests
  • Move CliCustomNodeRole to CliEventCollector, which is a custom Druid node with command eventCollector.
  • Remove ITHighAvailabilityTest and related files

Note: Usage of DruidContainer will not be the norm but the exception to test out backward compatibility
and/or Docker related changes only. All other use cases should continue to use EmbeddedDruidServers.

CliEventCollector

Both the standard ITs and revised ITs used a CliCustomNodeRole in ITHighAvailabilityTest
This patch removes all the code related to these.
ITHighAvailabilityTest has already been migrated to the embedded version, HighAvailabilityTest.

The CliCustomNodeRole has been renamed to CliEventCollector and moved to testing-tools
This class now serves a dual purpose:

  • Verify custom node role behaviour in CustomNodeRoleDockerTest
  • Run an event collector server which can be used in embedded tests.
    • The event collector receives metrics from other Druid services using the HttpPostEmitter.
    • It forwards the metrics to a LatchableEmitter.
    • Thus tests can wait on specific metric events being emitted from Druid services.
    • This is particularly useful while running Docker tests.
    • Using a Cli* provides a lot of wiring out of the box without having to write new code.

Sample test run

https://github.com/apache/druid/actions/runs/16416105923/job/46388247532?pr=18302

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.druid.testing.embedded.docker.IngestionDockerTest
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 72.72 s -- in org.apache.druid.testing.embedded.docker.IngestionDockerTest
[INFO] Running org.apache.druid.testing.embedded.docker.IngestionBackwardCompatibilityDockerTest
[INFO] Running org.apache.druid.testing.embedded.docker.IngestionBackwardCompatibilityDockerTest$Apache31
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 82.56 s -- in org.apache.druid.testing.embedded.docker.IngestionBackwardCompatibilityDockerTest$Apache31
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 82.57 s -- in org.apache.druid.testing.embedded.docker.IngestionBackwardCompatibilityDockerTest
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0

Advantages

  • Can be used to run embedded tests against the distribution Docker image as opposed to the current ITs
    which use a custom IT-only image
  • Takes us 1 step closer to completely removing the old IT frameworks
  • Leverages all the benefits of the embedded test framework while also testing out the Docker containers
  • Can be easily used for backward compatibility tests

Next steps

  • As druid-testcontainers evolves, we can try to contribute it to the main Testcontainers repo
  • Add a Druid command to run a single process Druid cluster. This would be particularly beneficial for users of the Druid testcontainer.
  • Once we have phased out the old ITs, all integration tests will be based on EmbeddedClusterTestBase
  • Most of these tests will just use EmbeddedDruidServer (and required containers only, like KafkaResource)
  • These tests will use Indexer as worker as they are faster and lighter than MiddleManagers
  • There will be only a small number of tests that use DruidContainer, some of which will use MiddleManagers

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Akshat-Jain Akshat-Jain reopened this Jul 21, 2025
Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments here and there...some of them might be considered personal preference/etc so not addressing all is ok as well :)

uses: ./.github/workflows/worker.yml
with:
script: .github/scripts/run_unit-tests -Dtest=!QTest,'${{ matrix.pattern }}' -Dmaven.test.failure.ignore=true
script: .github/scripts/run_unit-tests -Dtest=!QTest,!*DockerTest*,'${{ matrix.pattern }}' -Dmaven.test.failure.ignore=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quilty of not moving the QTest into the validation phase ; which seem to have given rise to this pattern which could be considered bad practice

this is not really serious...it could be fixed separaetly - but nicely outlines your requirement that you just wanted to add 1 more test....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Please check the latest comments.

Comment on lines 51 to 52
source .github/scripts/distribution_checks_env.sh
.github/scripts/run_unit-tests -Dtest=*DockerTest* -Ddruid.testing.docker.image=$DRUID_DIST_IMAGE_NAME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of trying to reuse the run_unit-tests script; you could possibly make a separate one for this; and run that (you could still invode run_unit-tests from there if you want)

you should preferrably also use the worker.yml to run it....that way if you want to test the workflow you could just try to run that script on your own machine!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of trying to reuse the run_unit-tests script; you could possibly make a separate one for this; and run that (you could still invode run_unit-tests from there if you want)

Could you please elaborate? What would be the advantage of putting this in a separate script if we would still invoke run_unit-tests from there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could run it locally to reproduce what's happening on the CI; you could call mvn directly; but you will also need to build the image...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added script run_docker-tests. To run all Docker tests locally,

$ DRUID_DIST_IMAGE_NAME=apache/druid:34.0.0-rc1 .github/scripts/run_docker-tests

This doesn't include the image building yet though.

Comment on lines +282 to +291
private static void createLogDirectory(File dir)
{
try {
FileUtils.mkdirp(dir);
Files.setPosixFilePermissions(dir.toPath(), PosixFilePermissions.fromString("rwxrwxrwx"));
}
catch (Exception e) {
throw new RuntimeException(e);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't the MountedDir provide a place for this method? that way if someone needs similar in the future it might got reused...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated ,thanks for the suggestion!

Comment on lines 67 to 72
return new DruidContainerResource(DruidCommand.INDEXER)
.addProperty("druid.lookup.enableLookupSyncOnStartup", "false")
.addProperty("druid.processing.buffer.sizeBytes", "50MiB")
.addProperty("druid.processing.numMergeBuffers", "2")
.addProperty("druid.processing.numThreads", "5");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like the DruidCommand was over and underused at the same time...

  • it contains some stuff so new DruidContainerResource(DruidCommand.INDEXER) could create a valid object
  • its underused as a good one will have these 4 properties set...

feels like there should be a DruidOverlordContainer / DruidIndexerContainer and similar....it could still be an innerclass here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I have moved all required properties to DruidCommand so that all users of DruidContainer can get those default properties out of the box.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 28, 2025

Thanks for the suggestions, @kgyrtkirk !

I have made the following changes:

  • Moved all required properties of DruidCommand so that all users of DruidContainer can benefit from the default set of properties.
  • Removed distribution-checks.yml workflow and distribution_checks_env.sh as you suggested
  • Used File instead of String where applicable
  • Used path /druid/deep-store on container

I am not very clear on how the usage of worker.yml/run_unit-tests script in the new workflow.
I am open to suggestions.

public class IngestionDockerTest extends EmbeddedClusterTestBase
{
static {
System.setProperty("druid.testing.docker.image", "apache/druid:35.0.0-SNAPSHOT");
Copy link
Contributor

@gianm gianm Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this in a static initializer doesn't seem right, because it will be set when the class is loaded, not necessarily when the test runs. Maybe better to use @BeforeAll.

Although, it'd be even better to not have to set the property at all. Like, perhaps the property just sets a default, but tests are able to specify their own when constructing the DruidContainer. That's how it works for other kinds of testcontainers.

Also- about the value- do we need to change the value when Druid's version is upgraded? Is it possible to get it programmatically, such as from DruidContainerResource.class.getPackage().getImplementationVersion()? Should work when running from a jar, possibly not from an IDE.

Finally, if we are going to keep needing to set this property, better to refer to the name by DruidContainerResource.PROPERTY_TEST_IMAGE rather than hardcoding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this in a static initializer doesn't seem right, because it will be set when the class is loaded, not necessarily when the test runs. Maybe better to use @BeforeAll.

Sorry, this was not meant to be committed and was for my local testing only.
Must have crept in in a recent commit.

In CI workflows, this property is always passed as a Java argument -Ddruid.testing.docker.image.
So, it needs to be set only when running the test locally.

I chose purposefully not to use a default image so that each test has to set it explicitly.
KafkaContainer seems to have deprecated the no-arg constructor too, so that test writers explicitly call out the required image name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local testing, what do you think would be the best way to specify this property?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to require people to set the system property for local testing. Some thought would be needed on part of the person running the tests about what image they want to use. If they want to test against their current branch, they'll need to build a Docker image. Maybe they don't care about that; maybe they just want to test the Docker embedded test itself, in which case it would be fine to run against a released Apache image.

The error message when the property isn't set should spell this out. It should mention that the property is typically set during CI, but isn't set outside CI because we don't know what image you want to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thought also needs to be given to how committers should run mvn test when they test release builds as part of release voting. Do they need to build a Docker image first in order to run this test? Do they need to specify -Ddruid.testing.docker.image to mvn? Or is it automatic as part of the mvn workflow? I am curious what you think about these items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up, @gianm !

For release voting, I feel it would be best for committers to download the image built as part of the RC and then running mvn test -Ddruid.testing.docker.image=apache/druid:rc1 -Dtest=*DockerTest*.
Since the release voting process is not automated anyway (perhaps by design so that voters are aware of the steps they go through), it is probably okay to require specifying -Ddruid.testing.docker.image explicitly.

We can probably add these instructions to release guidelines (and the vote mail) so that users can always copy paste the commands.

What are your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message when the property isn't set should spell this out. It should mention that the property is typically set during CI, but isn't set outside CI because we don't know what image you want to use.

Updated the message in DruidContainerResource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the test should be disabled by default, and enabled using a mvn profile. The case for disabling it by default is that it isn't actually testing the current code in the repository. It's testing a image pulled from somewhere else. Having it be its own command makes it more clear what is happening. During voting if committers want to test the docker image too, that can be done with a special mvn command separate from the typical test command.

How does this sound to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!
I agree that this would be the cleanest approach.

@Akshat-Jain Akshat-Jain reopened this Jul 29, 2025
@Akshat-Jain Akshat-Jain reopened this Jul 29, 2025
@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 31, 2025

@gianm , I also needed to include some other changes to provide visibility into the containers and are very useful in writing these Docker-based tests.

  • Converted a CliCustomNodeRole to a CliEventCollector
  • This is a custom test-only Druid service that just collects events over HTTP.
  • The events are then forwarded to a LatchableEmitter which allows watching for certain events.
  • Added a CustomNodeRoleDockerTest which supersedes the ITHighAvailabilityTest
  • Removed code related to the old high availability tests.

I hope these extra changes have not made the PR too bloated.

@kfaraz kfaraz requested review from gianm and kgyrtkirk July 31, 2025 14:28
Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! few nit comments here and there. My biggest question is the one on container friendly embedded hostnames and why it is not just the default way services are setting up their hostname

@Override
public EmbeddedDruidCluster createCluster()
{
coordinator.usingImage(DruidContainer.Image.APACHE_31);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something someone is going to have to periodically update and commit to master?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think we can keep testing against 31 until we decide to make some major change which is not backward compatible with 31 anymore.

I didn't try using an image older than 31, but we could probably go as far back as 27 and still be backward compatible (at least for the things that are being verified in these tests).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test will get skipped by default since it has DockerTest in the name. IMO, it's sensible to always run this particular test. The main reason I thought that Docker tests in general shouldn't all run is that they're mostly intended to run against the "current code". They don't serve their purpose unless the person running the tests actually creates an image matching the current checked out repo. This needs to be done manually so it makes sense that it's a separate command.

But, the backwards compatibility test isn't intended to deploy the "current code". It wants to deploy a specific older version. So, it doesn't have that problem, and it's therefore good to always run it.

Btw, IMO it's nicer to use annotations rather than test names to control what tests run in a standard mvn run. The downside of using names is that it's indirect. It makes it less obvious which tests run and which don't, and also means that if you want to change whether a test runs, you have to rename it.

Copy link
Contributor Author

@kfaraz kfaraz Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, IMO it's nicer to use annotations rather than test names to control what tests run in a standard mvn run. The downside of using names is that it's indirect. It makes it less obvious which tests run and which don't, and also means that if you want to change whether a test runs, you have to rename it.

I agree. In the current code, I am using the @Tag("docker-tests") annotation (applied on DockerTestBase) to skip the Docker-based tests.
The *DockerTest* class name is to tell mvn verify to look for these classes while searching for eligible integration tests. Otherwise, it only looks for *IT or IT* classes by default.

Copy link
Contributor Author

@kfaraz kfaraz Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the backwards compatibility test isn't intended to deploy the "current code". It wants to deploy a specific older version. So, it doesn't have that problem, and it's therefore good to always run it.

Thanks for the suggestion!

In that case, I will remove the BackwardCompatiblityIngestionDockerTest from the DockerTestBase hierarchy. I will also update it to use only Druid 31 containers and embedded servers for it (currently it uses a mix of Druid 31 containers, current code Druid containers and embedded servers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. In the current code, I am using the @Tag("docker-tests") annotation (applied on DockerTestBase) to skip the Docker-based tests.
The *DockerTest* class name is to tell mvn verify to look for these classes while searching for eligible integration tests. Otherwise, it only looks for *IT or IT* classes by default.

Ah, I see. I missed that when reading the maven stuff.

* Uses a container-friendly hostname for all embedded services, Druid as well
* as external.
*/
public EmbeddedDruidCluster useContainerFriendlyHostname()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason of having clusters opt into containerFriendlyHostname? Is there a downside or problem of using it if you don't have to, say for an existing embedded test you have written, HighAvailability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point!

All the existing embedded tests would still work if we always used the containerFriendly hostname. (since the Docker tests already use embedded services and they are able to connect to each other seamlessly)
In fact, that's how it used to be (except we were using InetAddress.getLocalHost().getCanonicalHostName() instead of InetAddress.getLocalHost().getHostAddress()),
but we changed it in #18228 as there were some apprehensions with using the canonical host name,
ref https://github.com/apache/druid/pull/18228/files#r2197340379.

But now that we are using the getHostAddress() which is simply the IP address, I think we can just stick to using the containerFriendly one all the time.

cc: @clintropolis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an ease of use perspective I think it would be awesome to get to a point where we default to container friendly so people who are writing new tests don't need to have to decide/realize their test requires it. But I don't think it is blocking for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current default (i.e. localhost) should already be good enough for all (non container) embedded tests.

For Druid container based tests, users writing new tests need not be conscious of the choice as long as they extend DockerTestBase.
Also, since Druid containers are not meant to be used frequently except in a couple of smoke tests, the extra step is probably okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I agree. I'm probably just overthinking it since I'm currently working on the KafkaDataFormat tests which requires an extra docker container that can talk to the kafka container :)

@kfaraz
Copy link
Contributor Author

kfaraz commented Aug 6, 2025

Thanks for the suggestions, @capistrant ! I have updated the javadocs as suggested and replied to your other comments.
Please let me know what you think.

Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm cool with current state of this. Will defer to you @kfaraz if you want to get final word from Gian and/or Zoltan on your changes after their review before you merge

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, since I have one comment remaining but I don't consider it blocking: #18302 (comment)

@Akshat-Jain Akshat-Jain closed this Aug 7, 2025
@Akshat-Jain Akshat-Jain reopened this Aug 7, 2025
@Akshat-Jain Akshat-Jain closed this Aug 7, 2025
@Akshat-Jain Akshat-Jain reopened this Aug 7, 2025
@Akshat-Jain Akshat-Jain closed this Aug 7, 2025
@Akshat-Jain Akshat-Jain reopened this Aug 7, 2025
@kfaraz kfaraz merged commit 33c38af into apache:master Aug 7, 2025
94 of 95 checks passed
@kfaraz kfaraz deleted the add_druid_container branch August 7, 2025 16:48
@kfaraz
Copy link
Contributor Author

kfaraz commented Aug 7, 2025

Thanks for the reviews, @kgyrtkirk , @capistrant , @gianm !
Thanks for the assist with the CI, @Akshat-Jain !

@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants